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Abstract 

We consider the problem of learning a dictionary matrix from a number of observed signals, which are assumed to be 
generated via a linear model with a common underlying dictionary. In particular, we derive lower bounds on the minimum 
achievable worst case mean squared error (MSE), regardless of computational complexity of the dictionary learning (DL) 
schemes. By casting DL as a classical (or frequentist) estimation problem, the lower bounds on the worst case MSE are 
derived by following an established information-theoretic approach to minimax estimation. The main conceptual contribution 
of this paper is the adaption of the information-theoretic approach to minimax estimation for the DL problem in order to 
derive lower bounds on the worst case MSE of any DL scheme. We derive three different lower bounds applying to different 
generative models for the observed signals. The first bound applies to a wide range of models, it only requires the existence 
of a covariance matrix of the (unknown) underlying coefficient vector. By specializing this bound to the case of sparse 
coefficient distributions, and assuming the true dictionary satisfies the restricted isometry property, we obtain a lower bound 
on the worst case MSE of DL schemes in terms of a signal to noise ratio (SNR). The third bound applies to a more restrictive 
subclass of coefficient distributions by requiring the non-zero coefficients to be Gaussian. While, compared with the previous 
two bounds, the applicability of this final bound is the most limited it is the tightest of the three bounds in the low SNR 
regime. A particular use of our lower bounds is the derivation of necessary conditions on the required number of observations 
(sample size) such that DL is feasible, i.e., accurate DL schemes might exist. By comparing these necessary conditions with 
sufficient conditions on the sample size such that a particular DL scheme is successful, we are able to characterize the regimes 
where those algorithms are optimal (or possibly not) in terms of required sample size. 

Index Terms 

Compressed Sensing, Dictionary Learning, Minimax Risk, Fano Inequality. 


I. Introduction 

According to m, the worldwide internet traffic in 2016 will exceed the Zettabyte thresholdQ In view of the pervasive 
massive datasets generated at an ever increasing speed El, El, it is mandatory to be able to extract relevant information 
out of the observed data. A recent approach to this challenge, which has proven extremely useful for a wide range of 
applications, is sparsity and the related theory of compressed sensing (CS) El-lSl. In our context, sparsity means that the 
observed signals can be represented by a linear combination of a small number of prototype functions or atoms. In many 
applications the set of atoms is pre-specified and stored in a dictionary matrix. However, in some applications it might 
be necessary or beneficial to adaptively determine a dictionary based on the observations ||7]-||9l. The task of adaptively 
determining the underlying dictionary matrix is referred to as dictionary learning (DL). DL has been considered for a 
wide range of applications, such as image processing cni-iiii, blind source separation ES, sparse principal component 
analysis m, and more. 

In this paper, we consider observing N signals G K™ generated via a fixed (but unknown) underlying dictionary 
D G (which we would like to estimate). More precisely, the observations are modeled as noisy linear combinations 

yfc=Dxfe-fnfc, (1) 

where is assumed to be zero-mean with i.i.d. components of variance cr^. To formalize the estimation problem underlying 
DL, we assume the coefficient vectors to be zero-mean random vectors with finite covariance matrix We highlight 

Parts of this work were previously presented at the 22nd European Signal Processing Conference, Lisbon, PT, Sept. 2014. 

^One Zettabyte equals 10^^ bytes. 


2 


that our first main result, i.e., Theorem IIII. 1 1 applies to a very wide class of coefficient distributions since it only requires 
a finite covariance matrix In particular, Theorem IIII. 1 1 also applies to non-sparse random coefficient vectors. However, 
the main focus of our paper (in particular, for Corollary IIII.2I and Theorem IIII.31 I will be on distributions such that the 
coefficient vector is strictly s-sparse with probability one. In this work, we analyze the difficulty inherent to the problem 
of estimating the true dictionary D G which is deterministic but unknown, from the measurements y^, which are 

generated according to the linear model ([T]l. 

If we stack the observations y^, for k = 1,..., N, column-wise into the data matrix Y G one can cast DL as 

a matrix factorization problem ifTTl . Given the data matrix Y, we aim to find a dictionary matrix D G such that 

Y = DX + N (2) 

where the column sparse matrix X G contains in its fcth column the sparse expansion coefficients of the signal 

y/c. The noise matrix N = (rii,..., n^v) S accounts for small modeling and measurement errors. 

a) Prior Art: A plethora of DL methods have been proposed and analyzed in the literature (e.g., ||7], lfT8l - ll26l l. 
In a Bayesian setting, i.e., modeling the dictionary as random with a known prior distribution, the authors of lESl, El, 
lIZTl devise a variant of the approximate message passing scheme ll28l to the DL problem. The authors of |[T9l - ll22l . ll^ 
model the dictionary as non-random and estimate the dictionary by solving the (non-convex) optimization problem 

min ||Y-DX||^-f A||X||i, (3) 

DGX>,XeRPX'V 

where ||X||i = Y,k i T> C denotes a constraint set, e.g., requiring the columns of the learned dictionary 

to have unit norm. The term A||X||i (with sufficiently large A) in the objective (d enforces the columns of the coefficient 
matrix X to be (approximately) sparse. 

Assuming the true dictionary DgR"*^^ deterministic but unknown (its size p however is known) and the observations 
Yk are i.i.d. according to the model o, the authors of QSl-EIl provide upper bounds on the distance between the 
generating dictionary D and the closest local minimum of d. For the square (i.e., p = m) and noiseless (N = 0) setting, 
lf2TI showed that N — O{p\og{p)) observations suffice to guarantee that the dictionary is a local minimum of d- LFsing 
the same setting (square dictionary and noiseless measurements), ll^ proved the scaling N = 0{plog{p)), for arbitrary 
sparsity level, to be actually sufficient such that the dictionary matrix can be recovered perfectly from the measurements 
Yfc i Our analysis, in contrast, takes measurement noise into account and yields lower bounds on the required sample size 
in terms of SNR. While the results on the square-dictionary and noiseless case are theoretically important, their practical 
relevance is limited. Considering the practically more relevant case of an overcomplete (p > m) dictionary D and noisy 
measurements (N ^ 0), the authors of Il20ll show that a sample size of N = 0{p^m) i.i.d. measurements yk suffices for 
the existence of a local minimum of the cost function in d which is close to the true dictionary D. 

By contrast to methods based on solving d, ^ recent line of work 0, Ea, Ei presents DL methods based on 
(graph-)clustering techniques. In particular, the set of observed samples yk is clustered such that the elements within each 
cluster share a single generating column dj of the underlying dictionary. The authors of show that a sample size 
N = 0{p^ logp) suffices for their clustering-based method to accurately recover the true underlying dictionary. However, 
this result applies only for sufficiently incoherent dictionaries D and for the case of vanishing sparsity rate, i.e., s/p 0. 
The scaling of the required sample size with the square of the number p of dictionary columns (neglecting logarithmic 
terms) is also predicted by our bounds. What sets our work apart from ll26ll is that we state our results in a non-asymptotic 
setting, i.e., our bounds can be evaluated for any given number p of dictionary atoms, dimension m of observed signals 
and nominal sparsity level s. 

Although numerous DL schemes have been proposed and analyzed, existing analyses typically yield sufficient conditions 
(e.g., on the sample size N) such that DL is feasible. In contrast, necessary conditions which apply to any DL scheme 
(irrespective of computational complexity) are far more limited. We are only aware of a single fundamental result that 
applies to a Bernoulli-Gauss prior for the coefficient vectors x^ in ([T]); This result, also known as the “coupon collector 

^With high probability and up to scaling and permutations of the dictionary columns. 










3 


phenomenon” Il25]l . states that in order to have every column dj of the dictionary contributing in at least one observed 
signal (i.e., the corresponding entry Xk,j of the coefficient vector in ([T]| is non-zero) the sample size has to scale linearly 
with (l/0)logp where 0 denotes the probability V{xk,j 3 ^ 0}. For the choice 9 = s/p, which yields s-sparse coefficient 
vectors with high probability, this requirement effectively becomes N > ci{p/s)\ogp, with some absolute constant ci. 

b) Contribution: In this paper we contribute to the understanding of necessary conditions or fundamental recovery 
thresholds for DL, by deriving lower bounds on the minimax risk for the DL problem. We define the risk incurred by a DL 
scheme as the mean squared error (MSE) using the Frobenius norm of the deviation from the true underlying dictionary. 
Since the minimax risk is defined as the minimum achievable worst case MSE, our lower bounds apply to the worst case 
MSE of any algorithm, regardless of its computational complexity. This paper seems to contain the first analysis that 
targets directly the fundamental limits on the achievable MSE of any DL method. 

For the derivation of the lower bounds, we apply an established information-theoretic approach (cf. Section [Till to 
minimax estimation, which is based on reducing a specific multiple hypothesis problem to minimax estimation of the 
dictionary matrix. Although this information-theoretic approach has been successfully applied to several other (sparse) 
minimax estimation problems ll^ - ll34ll . the adaptation of this method to the problem of DL seems to be new. The lower 
bounds on the minimax risk give insight into the dependencies of the achievable worst case MSE on the model parameters, 
i.e., the sparsity s, the dictionary size p, the dimension m of the observed signal and the SNR. Our lower bounds on the 
minimax risk have direct implications on the required sample size of accurate DL schemes. In particular our analysis 
reveals that, for a sufficiently incoherent underlying dictionary, the minimax risk of DL is lower bounded by cip^/(SNRAf), 
where ci is some absolute constant. Thus, for a vanishing minimax risk it is necessary for the sample size N to scale 
linearly with the square of the number p of dictionary columns and inversely with the SNR. Finally, by comparing our 
lower bounds (on minimax risk and sample size) with the performance guarantees of existing learning schemes, we can 
test if these methods perform close to optimal. 

A recent work on the sample complexity of dictionary learning llTSl presented upper bounds on the sample size such 
that the (expected) performance of an ideal learning scheme is close to its empirical performance observed when applied 
to the observed samples. While the authors of IfTSi measure the quality of the estimate D via the residual error obtained 
when sparsely approximating the observed vectors yfc, we use a different risk measure based on the squared Frobenius 
norm of the deviation from the true underlying dictionary. Clearly, these two risk measures are related. Indeed, if the 
Frobenius norm ||D — D||f is small, we can also expect that any sparse linear combination Dx using the dictionary D 
can also be well represented by a sparse linear combination Dx' using D. Our results are somewhat complementary to 
the upper bounds in lf35l in that they yield lower bounds on the required sample size such that there may exist accurate 
learning schemes (regardless of computational complexity). 

The remainder of this paper is organized as follows; We introduce the minimax risk of DL and the information-theoretic 
method for lower bounding it in Section [H] Lower bounds on the minimax risk for DL are presented in Section [HI] We 
also put our bounds into perspective by comparing their implications to the available performance guarantees of some DL 
schemes. Detailed proofs of the main results are contained in Section HVl 

Throughout the paper, we use the following notation: Given a natural number fc £ N, we define the set [k] = {1,..., k}. 
For a matrix A £ we denote its Frobenius norm and its spectral norm by ||A||f = y'TrjAA^} and ||A|| 2 , 

respectively. The open (Frobenius-norm) ball of radius r > 0 and center D £ is denoted B{'D,r) = {D' £ 

Ig^mxp . ||p) _ p)'||p < 7 -} Por a square matrix A, the vector containing the elements along the diagonal of A is denoted 
diag{A}. Analogously, given a vector a, we denote by diag{a} the diagonal matrix whose diagonal is obtained from 
a. The fcth column of the identity matrix is denoted e^. For a matrix X £ , we denote by supp(X) the A-tuple 

(supp(xi),... ,supp(xAr)) of subsets given by concatenating the supports supp(xfc) of the columns x^ of the matrix 
X. The complementary Kronecker delta is denoted Sip, i.e., Sij/ = 0 if I = I' and equal to one otherwise. We denote 
by 0 the vector or matrix with all entries equal to 0. The determinant of a square matrix C is denoted |C|. The identity 
matrix is written as I or when the dimension d x d is not clear from the context. Given a positive semidefinite (psd) 
matrix C, we write its smallest eigenvalue as Amin(C). The natural and binary logarithm of a number b are denoted log(5) 
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and log 2 ( 6 ), respectively. For two sequences g{N) and f{N), indexed by the natural number N, we write g = 0{f) and 
g = 0(/) if, respectively, g{N) < C'f{N) and g{N) > C"f{N) for some constants C,C" > 0. If g{N)jf{N) —^ 0, 
we write g = o(/). We denote by Ex/(X) the expectation of the function /(X) of the random vector (or matrix) X. 

II. Problem Formulation 

A. Basic Setup 

For our analysis we assume the observations are i.i.d. realizations according to the random linear model 

y = Dx + n. (4) 

Thus, the vectors y^, and n^, for fc = 1,.. ., iV, in ([T]i are i.i.d. realizations of the random vectors y, x and n in (|4|i. 
Here, the matrix with p > m, represents the deterministic but unknown underlying dictionary, whose columns 

are the building blocks of the observed signals y^. The vector x represents zero mean random expansion coefficients, 
whose distribution is assumed to be known. Our analysis applies to a wide class of distributions. In fact, we only require 
the existence of the coveriance matrix 

S, 4 Ex{xx^}. (5) 

The effect of modeling and measurement errors are captured by the noise vector n, which is assumed independent of x 
and is white Gaussian noise (AWGN) with zero mean and known variance cr^. When combined with a sparsity enhancing 
prior on x, the linear model (|4]i reduces to the sparse linear model (SLM) ll^ . which is the workhorse of CS Jb), llJTl . 
l(38l . However, while the works on the SLM typically assume the dictionary D in (|4]l perfectly known, we consider the 
situation where D is unknown. 

In what follows, we assume the columns of the dictionary D to be normalized, i.e., 

D e = {B e W^^P\elB^Bek = 1, for all k € [p]}. ( 6 ) 

The set T> is known as the oblique manifold ll20l . ll^ . Il40l . For fixed problem dimensions p, m and s, requiring (| 6 ]l 
effectively amounts to identifying SNR with the quantity ||Sa;|| 2 /(T^. Our analysis is local in the sense that we consider 
the true dictionary D to belong to a small neighborhood, i.e., 

D e ^-( 00 , r) ^ S(Do, r) n P = {D' e V-. ||D'-Do||f < r} (7) 

with a hxed and known “reference dictionary” Dg € V and known radiu^ r < 2y/p. This local analysis avoids ambiguity 

issues (which we discuss below) that are intrinsic to DL. However, the lower bounds on the minimax risk derived on the 
locality constraint ([TJ trivially also apply to the global DL problem, i.e., where we only require (| 6 l). 

B. The minimax risk 

We will investigate the fundamental limits on the accuracy achievable by any DL scheme, irrespective of its computational 
complexity. By a DL scheme, we mean an estimator D(-) which maps the observation Y = (yi,..., y^r) to an estimate 
D(Y) of the true underlying dictionary D. The accuracy of a given learning method will be measured via the MSE 
Ey{||D(Y) — D||p}, which is the expected squared distance of the estimate D(Y) from the true dictionary, measured in 
Frobenius norm. Note that the MSE of a given learning scheme D(Y) depends on the true underlying dictionary D, which 
is hxed but unknown. Therefore, the MSE cannot be minimized uniformly for all D BTI . However, for a given estimator 
D(-), a reasonable performance measure is the worst case MSE sup]3gY(Do r) Ey{||D(Y) — D||p} ll42ll . The optimum 
estimator under this criterion has smallest worst case MSE among all possible estimators. This smallest worst case MSE 
(referred to as minimax risk) is an intrinsic property of the estimation problem and does not depend on a specihc estimator. 
Let us highlight that the minimax risk is dehned here for a hxed and known distribution of the coefficient vector x^ in O- 

^Considering only values not exceeding 2y/p for the radius r in Q is reasonable since for any radius r > 2y/p we would obtain X(T>Q^r) = T) 
yielding the global DL problem. 
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In what follows, we derive three different lower bounds on the minimax risk by considering different types of coefficient 
distributions. 

Concretely, the minimax risk e* for the problem of learning the dictionary D based on the observation of N i.i.d. 
observations y^, distributed according to the model Q, is 

e*^inf sup EY{||D(Y)-D||i}. (8) 

D DeA'(Do,r-) 

In general, the minimax risk e* depends on the sample size N, the dimension m of the observed signals, the number p 
of dictionary elements, the sparsity degree s and the noise variance cr^. For the sake of light notation, we will not make 
this dependence explicit. 

Note that while, at first sight, the locality assumption O may suggest that our analysis yields weaker results than for 
the case of not having this locality assumption, the opposite is actually true. Indeed, our lower bounds on the minimax risk 
predict that even under the additional a-priori knowledge that the true dictionary belongs to the (small) neighborhood of a 
known reference dictionary Dq, the minimax risk is lower bounded by a strictly positive number which, for a sufficiently 
large sample size, does not depend on the size of the neighborhood at all. Also, from the definition ® it is obvious that 
any lower bound on the minimax risk e* under the locality constraint (|7]i is simultaneously a lower bound on the minimax 
risk for global DL, which is obtained from ® by replacing the constraint D S A’(Do,r) in the inner maximization with 
the constraint D G X>. 

The minimax problem ® typically cannot be solved in closed-form. Instead of trying to exactly solve ® and determine 
£*, we will derive lower bounds on e* by adapting an established information-theoretic methodology (cf., e.g., ESI, El, 
||43]| ) to the DL problem. Having a lower bound on the minimax risk e* allows to asses the performance of a given DL 
scheme. In particular, if the worst case MSE of a given scheme is close to the lower bound, then there is no point in 
searching for alternative schemes with substantially better performance. Let us highlight that our bounds apply to any DL 
scheme, regardless of its computational complexity. In particular, these bounds apply also to DL methods which do not 
exploit neither the knowledge of the sparse coefficient distribution nor of the noise variance. 

C. Information-theoretic lower bounds on the minimax risk 

A principled approach |[30l, El, Il43l to lower bounding the minimax risk e* of a general estimation problem is based 
on reducing a specific multiple hypothesis testing problem to minimax estimation of the dictionary D. More precisely, if 
there exists an estimator with small worst case MSE, then this estimator can be used to solve a hypothesis testing problem. 
However, using Eano’s inequality, there is a fundamental limit on the error probability for the hypothesis testing problem. 
This limit induces a lower bound on the worst case MSE of any estimator, i.e., on the minimax risk. Let us now outline 
the details of the method. 

Eirst, within this approach one assumes that the true dictionary D in (|4|l is taken uniformly at random (u.a.r.) from a 
finite subset Vq = C A’(Do,r) for some L G N (cf. Eig. [T]). This subset Vq is constructed such that (i) any 

two distinct dictionaries D/, D// G Dq are separated by at least v^, i-e., ||Di — D;/||f > and (ii) it is hard to detect 
the true dictionary D, drawn u.a.r. out of Vq, based on observing Y. The existence of such a set Vq yields a relation 
between the sample size N and the remaining model parameters, i.e., m, p, s, a which has to be satisfied such that at 
least one estimator with minimax-risk not exceeding e may exist. 

In order to find a lower bound e* > e on the minimax risk e* (cf. (I8]l), we hypothesize the existence of an estimator 
D(Y) achieving the minimax risk in ([8]l. Then, the minimum distance detector 

argmin ||D(Y)-D'||f (9) 

D'GX>o 

recovers the correct dictionary D G Dq if D(Y) belongs to the open ball (indicated by the dashed circles 

in Eig. [T]| centered at D and with radius v^. The information-theoretic method ll30l . ED, i431 of lower bounding the 
minimax risk e* consists then in relating, via Eano’s inequality ll44l Ch. 2], the error probability P{D(Y) ^ S(D, v^)} 
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Fig. 1. A finite ensemble T)q = containing L = 4 dictionaries used for deriving a lower bound e* > e on the minimax risk e* (cf. js)). 

For the true dictionary D = Di, we also depicted a typical realization of an estimator D achieving the minimax risk. 



Fig. 2. Infonnation-theoretic method for lower bounding the minimax risk. 


to the mutual information (MI) between the observation Y = (yi,..., y^v) and the dictionary D in (|4]i, which is assumed 
to be drawn u.a.r. out of Vq. 

Thus, within this approach, the estimation problem of DL is interpreted as a communication problem as illustrated in 
Fig. |2] The source selects the true dictionary D = D; by drawing u.a.r. an element D; from the set Vq. This element D; 
then generates the “channel output” Y = (yi,..., yjv) via the model (|4|i for N channel uses. The observation model (|4|i 
acts as a channel model, relating the input D = Di to the output Y. A crucial step in the information-theoretic approach 
is the analysis of the MI defined by ll44i 

where p(Y, 1), p(Y) and p{l) denote the joint and marginal distributions, respectively, of the channel output Y and the 
random index 1. As it turns out, a key challenge for applying this method to DL is that the model (|4| does not correspond 
to a simple AWGN channel, for which the MI between output and input can be characterized easily. Indeed, the model 
(|4l) corresponds to a fading channel with the vector x representing fading coefficients. As is known from the analysis of 
non-coherent channel capacity, characterizing the MI between output and input for fading channels is much more involved 
than for AWGN channels Il45ll . In particular, we require a tight upper bound on the MI /(Y;Z) between the output Y 
and a random index I which selects the input D = D; u.a.r. from a finite set Vq C A’(Do,r). Upper bounding I{Y-,l) 
typically involves the analysis of the Kullback-Leibler (KL) divergence between the distributions of Y induced by different 
dictionaries T) — Di, I G [L], 

Unfortunately, an exact characterization of the KL divergence between Gaussian mixture models is in general not 
possible and one has to resort to approximations or bounds Il46l . A main conceptual contribution of this work is a strategy 
to avoid evaluating KL divergences between Gaussian mixture models. Instead, similar to the approach of OTl . we assume 
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that, in addition to the observation Y, we also have access to some side information T(X), which depends only on the 
coefficient vector x^, for k G [iV], stored column-wise in the matrix X= (xi,..., xat). Clearly, any lower bound on the 
minimax risk for the situation with the additional side information T(X) is trivially also a lower bound for the case of no 
side information, since the optimal learning scheme for the latter situation may simply ignore the side information T(X). 
As we will show rigorously in Appendix lAl we have the upper bound MI I{Y;l) < /(Y; /|T(X)), where /(Y; /|T(X)) 
is the conditional mutual information, given the side information T(X), between the observed data matrix Y and the 
random index 1. Thus, in order to control the MI /(Y; 1) it is sufficient to control the conditional MI /(Y; Z|T(X)), which 
turns out to be a much easier task. We will use two specific choices for T(X); T(X) =X and T(X) =supp(X). The 
choice T(X) = X will yield tighter bounds for the case of high SNR, while the choice T(X) = supp(X) yields more 
accurate bounds in the low SNR regime. As detailed in Section |IV] the problem of upper bounding /(Y; Z|T(X)) becomes 
tractable for both choices. 


III. Lower Bounds on the Minimax Risk For DL 

We now state our main results, i.e., lower bounds on the minimax risk of DL. The first bound applies to any distribution 
of the coefficient vector x, requiring only the existence of the covariance matrix Two further, more specialized, lower 
bounds apply to sparse coefficient vectors and moreover require the underlying dictionary D in ([T]i to satisfy a restricted 
isometry property (RIP) l47l . 


A. General Coefficients 

In this section, we consider the DL problem based on the model (|4]i with a zero-mean random coefficient vector x. We 
make no further assumptions on the statistics of x except that the covariance matrix exists. For this setup, the side 
information T(X) for the derivation of lower bounds on the minimax risk will be chosen as the coefficients itself, i.e., 
T(X)=X. Our first main result is the following lower bound on the minimax risk for the DL problem. 

Theorem III.l. Consider a DL problem based on N Ltd. observations following the model dH and with true dictionary 
satisfying (I7]l for some r < 2y/p. Then, if 

p(m-l)>50, (10) 

the minimax risk e* is lower bounded as 

e* > (l/320)min|r^-^^^|^-||-(p(m-l)/10- 1)|. (11) 

The first bound in (fTTT i. i.e., e* > r^/3200 complies (up to fixed constants) with the worst case MSE of a dumb 
estimator D which ignores the observation Y and always delivers a fixed dictionary Di G A’(Do,r). Since the true 
dictionary D also belongs to the neighborhood Al(Do,r), the MSE of this estimator is upper bounded by 

||D-D||2 = ||Di-D||2 = (||Di-Do||f + ||Do-D||f)' < 4r2. 

The second bound in (fTTT) (ignoring constants) is essentially the minimax risk e' of a simple signal in noise problem 


z = s -f n 


( 12 ) 


with AWGN n ~ A/^(0, ■ii^-^Ip(m-i)) the unknown non-random signal s of dimension p(m—l), which is also the 
dimension of the oblique manifold D EO). A standard result in classical estimation theory is that, given the observation 
of N i.i.d. realizations of the vector z in (fT^ . the minimax risk e' of estimating s G is ll42l Exercise 5.8 on 

pp. 403] 


e = 






( 13 ) 


^The constant 1/40 is an artifact of our proof technique and might be improved by a more pedantic analysis. 
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For fixed ratio ||Sa;|| 2 /cr^, the bound (fTTT l predicts that N = Q{pm) samples are required for accurate DL. Remarkably, 
this scaling matches the scaling of the sample size found in Il35l to be sufficient for successful DL. Note, however, that 
the analysis llLSl is based on the sparse representation error of a dictionary, whereas we target the Frobenius norm of the 
deviation from the true underlying dictionary. 


B. Sparse Coefficients 

In this section we focus on a particular subclass of probability distributions for the zero mean coefficient vector x in 
©. More specihcally, the random support supp(x) of the coefficient vector x is assumed to be distributed uniformly over 
the set S = {5 C [p] : |5| = s}, i.e.. 


P(supp(x) = 5) = 1 ^ = for any S G 


(14) 


We also assume that, conditioned on the support S = supp(x), the non-zero entries of x are i.i.d. with variance cr^, i.e., 
in particular 


Ex{x5X5|5} = ajs- 


(15) 


The sparse coefficient support model (fT4li is useful for performing sparse coding of the observed samples y^. Indeed, 
once we have learned the dictionary D, we can estimate for each observed sample y^, using a standard CS recovery 
method, the sparse coefficient vector x^. Sparse source coding is then accomplished by using the sparse coefficient vector 
to represent the signal y^. For sparse source coding to be robust against noise, one has to require the underlying dictionary 
D to be well conditioned for sparse signals. While there are various ways of quantifying the conditioning of a dictionary, 
e.g., based on the dictionary coherence ED, ESI, we will focus here on the restricted isometry property (RIP) EH, 133, 
BSlI . A dictionary D is said to satisfy the RIP of order s with constant Sg if 


(1 —(5s)||z||^ < ||Dz|p < (l + (5s)||z||^, for any z G such that ||z||o < s. (16) 

Let us formally dehne the signal-to-noise ratio (SNR) for the observation model (IHl as 


SNR4E^{||Dx||2}/E„{||n||2}. 


(17) 


Note that the SNR depends on the unknown underlying dictionary D. However, if D satisfies the RIP (fThl l with constant 
6s, then we obtain the characterization 


ILiLlffl < SNR < 

ma^ ma^ 


(18) 


which depends on D only via the RIP constant Sg. For a small constant 6s, (fTsT i justifies the approximation SNR r; 

As can be verihed easily, any random coefficient vector x conforming with (fT4l) and (fTSl l possesses a hnite covariance 
matrix, given explicitly by 


= {s/p)allp. 


(19) 


Therefore we can invoke Theorem lIII.il which, combined with ( fT9l l and (fTST l. yields the following corollary. 

Corollary III.2. Consider a DL problem based on N i.i.d. observations according to the model (|4| and with true dictionary 
satisfying Q for some r < 2y/p. Furthermore, the random coefficient vector x in (|4]l conforms with (I14l l and (I15l l. If the 
dictionary D satisfies the RIP (I161 l with RIP-constant 6s <1/2 and moreover 

p(m-l)> 50, (20) 

then the minimax risk s* is lower bounded as 

e* > (1/320) min |r^ {p{m- 1)/10-1)|. (21) 

For sufficiently large sample size N the second bound in (1211) will be in force, and we obtain a scaling of the minimax 
risk as e* = 0(p^/(A^SNR)). In particular, this bound suggests a decay of the worst case MSE via 1/7V. This agrees 
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with empirical results in Il20ll . indicating that the MSE of popular DL methods typically decay with 1/A^. Moreover, the 
dependence on the sample size via 1/A^ is theoretically sound, since averaging the outcomes of a learning scheme over 
N independent observations reduces the estimator variance by 1/7V. Note that, as long as the hrst bound in d^Tt is not in 
force, the overall lower bound (ISTTi scales with 1/SNR, which agrees with the basic behavior of the upper bound derived 
in II20I on the distance of the closest local minimum of (l3]l to the true dictionary D. 

If we consider a hxed SNR (cf. O), our lower bound predicts that for a vanishing minimax risk e* the sample size 
N has to scale as N = Q{p^). This scaling is considerably smaller than the sample size requirement N = 0(p^m), which 
||20| proved to be sufficient in the noisy and over-complete setting, such that minimizing Q yields an accurate estimate 
of the true dictionary D. However, for vanishing sparsity rate {s/p —^ 0), the scaling N = Q{p^) matches the required 
sample size of the algorithms put forward in Q, 12^ . certifying that, for extremely sparse signals, they perform close to 
the information-theoretic optimum for fixed SNR. 

We will now derive an alternative lower bound on the minimax risk for DL based on the sparse coefficient model (fT4l i 
and (fT3T l by additionally assuming the non-zero coefficients to be Gaussian. In particular, let us denote by P a random 
matrix which is drawn u.a.r. from the set of all permutation matrices of size p x p. Furthermore, we denote by z S K.'* a 
multivariate normal random vector with zero mean and covariance matrix Based on the matrix P and vector 

z, we generate the coefficient vector x as 

X = P(z^,0ix(;^s))^ with z - A/'(0,cr^Is). (22) 

Theorem IIII.3I below presents a lower bound on the minimax risk for the low SNR regime where SNR < 
(1/(9V^))to/(2s). 

Theorem III.3. Consider a DL problem based on the model (l4]i such that O holds with some r < 2^/p and the underlying 
dictionary D satisfies the RIP of order s with constant 5s < 1/2 (cf. (II6I 1). We assume the coefficients x in (|4]i to be 
distributed according to (122b with SNR < (1/(9-\/M))to/(2s). Then, if 

p(m-l)> 50, (23) 

the minimax risk e* is lower bounded as 

e* > (1/12960) min jr^s, ^^;^^|^^(p(m-l)/10- 1)|. (24) 

The main difference between the bounds (1211 1 and (l24l l is their dependence on the SNR (fTTl i. While the bound (ISTT l. 
which applies to arbitrary coefficient statistics and does not exploit the sparse structure of the model (l22l l. depends on the 
SNR via 1/SNR, the bound (l24li shows a dependence via 1/SNR^. Thus, in the low SNR regime where SNR <C 1, the 
bound (I 24 I) tends to be tighter, i.e. higher, than the bound (l2Tb . 

We now show that the dependence of the bound (l24b on the SNR via 1/SNR^ agrees with the basic behavior of the 
constrained Cramer-Rao bound (CCRB) P9l . Indeed, if we assume for simplicity that p = s = 1 and the true dictionary 
(which is now a vector) is d = ei, we obtain for the CCRB ll49l Thm. 1] 

EY{(d(Y) - d)(d(Y) - d)^} ^ — - 1 — (I - eief) (25) 

bNR m^N 

for any unbiased learning scheme d(Y), i.e., which satisfies EY{d(Y)} = dlf| Thus, in this simplified setting, the 
dependence of the minimax bound (fTTb on the SNR via 1/SNR^ is also reflected by the CCRB. 

Let us finally highlight that the bound in Theorem HIT 3 1 is derived by exploiting the (conditional) Gaussianity of the 
non-zero entries in the coefficient vector. By contrast, the bounds in Theorem HIT II and Corollary HIT 2 1 do not require the 
non-zero entries to be Gaussian. 

^Using the notation of 1491 . we obtained 1251 from 1491 Thm. 1] by using the matrix U = (e 2 ,. • • which forms an orthonormal basis for the 
null space of the gradient mapping F(d) = with the constraint function /(d) = ||d ||2 —1- Moreover, for evaluating 1491 Thm. 1] we used 

the formula i = (1/2) Tr {C~^(d)^^^C~^(d)^^^} 1501 for the elements of the Fisher information matrix, which applies for a Gaussian 
observation with zero mean and whose covariance matrix C(d) depends on the parameter vector d. 
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C. A partial converse 

Given the lower bounds on the minimax risk presented in Sections IIII-AI and IIII-BI it is natural to ask wether these are 
sharp, i.e., there exist DL schemes whose worst case MSE comes close to the lower bounds. To this end, we consider a 
simple instance of the DL problem and analyze the MSE of a very basic DL scheme. As it turns out, in certain regimes, 
the worst case MSE of this simple DL approach essentially matches the lower bound (12111 . 

Theorem III.4. Consider a DL problem based on N i.i.d. observations according to the model (|4li and with true dictionary 
satisfying (|7]) with Dg = I and some r < 2y/p. Furthermore, the random coefficient vector x in (|4]l conforms with (I14l l 
and (O. Moreover, the non-zero entries of x have magnitude equal to one, i.e., x G { — 1,0,1}^. If r^/s < 1/10 and 
a < 0.4, there exists a DL scheme whose MSE satisfies 

Ey{||D(Y) - D|||} < 4(pV^) [(1 - r)VSNR + l] + 2pexp{-pN0.Ay{2a^)), (26) 

for any D G A’(Do,r). 

The proof of Theorem IIII.4I to be found at the end of Section |IV] will be based on a straightforward analysis of a 
simple DL method which is given by the following algorithm. 

Algorithm 1. Input: data matrix Y = (yi,..., yjv) 

Output: learned dictionary D(Y) 


1) Compute an estimate X of the coefficient matrix X = (xi,... ,xjv) by simple element-wise thresholding, i.e.. 


1 ,ifyk,i>0-5 

X = (xi,... ,X7v), with Xk,i = <( 0 , if \yk,i\ < 0.5 


-1 ,ifyk,i<0-5 


2) For each column-index j G \p], define 


j A P - 

® ke[N] 


(27) 


(28) 


3) Output 

D(Y) Z (di,... ,dp), with di = (29) 

Here, Z argmin^/g^^j^^ ||d' — d||2 denotes the projection of the vector d G M’" on the closed unit ball 

F(l) 4{d' gM™ : ||d'||2 < 1}. 


Note that the learned dictionary D(Y) obtained by Algorithm [T] might not have unit-norm columns so that it might not 
belong to the oblique manifold D. While this is somewhat counter-intuitive, as the true dictionary D belongs to D, this 
fact is not relevant for the derivation of upper bounds on the MSE incurred by D(Y). 

According to Theorem IIII.4I in the low-SNR regime, i.e., where SNR = o(l), and for sufficiently small noise variance, 
such that a < 0.4 and 

pexp(-pW0.4V(2cr^)) = o{{p^/N){l - r)VSNR), (30) 


the MSE of the DL scheme given by Algorithm [T] scales as 

Ev(||D(Y)-D|||)=o(tZ^). 


(31) 


We highlight that the scaling of the upper bound (OTT l essentially matches the scaling of the lower bound (l2Tll . certifying 
that the bound of Corollary II1I.2I is tight in certain regimes. 
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IV. Proof of the main results 

Before stating the detailed proofs of Theorem IIII.ll and Theorem IIII.3I we present the key idea behind and the main 
ingredients used for their proofs. At their core, the proofs of Theorem IIII. 1 1 and Theorem lIII.3l are based on the construction 
of a finite set T>o — {Di,...,Di} C D (cf. (jhl) of L distinct dictionaries having the following desiderata: 

• For any two dictionaries D;,Dp G Vq, 

||D,-DHlF>^i,r8e. (32) 

• If the true dictionary in (|4|i is chosen as D = D; G Vq, where I is selected u.a.r. from [L], then the conditional MI 
between Y and I, given the side information T(X)|§ is bounded as 

/(Y;Z|T(X)) <p (33) 


with some small rj. 

For the verification of the existence of such a set Vq, we rely on the following result: 

Lemma IV.l. For P G N such that 

log(P)/d< (1-2/10)V4, (34) 

there exists a set V = {bi};g[p] of P distinct binary vectors b; G {—1, satisfying 

||b; — bi/||o > d/lO, for any two different indices I, I' G [P]. (35) 

Proof: We construct the set V sequentially by drawing i.i.d. realizations b; from a standard Bernoulli vector b G 
{—1, l}'^. Consider two different indices l^V G [P]. Define the vector b = b; 0 b;' by element-wise multiplication and 
observe that 

\\hi-hi,\\o = {l/2)(p-J2br)- (36) 

^ re[d] ^ 

Each one of the three vectors b;,b;/,b G { — 1, contains zero-mean i.i.d. Bernoulli variables. We have 

P{||bi-bi,||o<rf/10}‘i’p{(d- ^ 6.)/2<d/10} 

rG[c!] 

= P{ ^ > d(l-2/10)}. (37) 

According to Lemma IA.2I 

P{ XI - (1-2/10)4 < exp(-d(l-2/10)V2). (38) 

r-efd] 

Taking a union bound over all (^) pairs I, I' G [P], we have from (IJTT i and (l38t that the probability of P i.i.d. draws 
{bi}z6 [P] violating dTSl l is upper bounded by 

Pi < exp(-4l-2/10)V2-f 21ogP), (39) 

which is strictly lower than 1 if (l34l i is valid. Thus, there must exist at least one set P = {bi};g[p] of cardinality P whose 

elements satisfy dTSl l. ■ 

The following result gives a sufficient condition on the cardinality L and threshold rj such that there exists at least one 
subset Vq Q T> of L distinct dictionaries satisfying (l32l l and (1331) . 

Lemma IV.2. Consider a DL problem based on the generative model (|4li such that (|7]l holds with some r < 2y/p. If 
(m—l)p > 50, there exists a set Vq C V of cardinality ^ = such that (I321 l and (I331 l (for the side information 


^Particular choices for T(X) are discussed at the end of Section liTcl 
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T(X)=Xj are satisfied with 


7j = 320N\\'E,\\2e/a^ 


and 

e < rV320. 


(40) 

(41) 


Proof: According to Lemma lIV.il for (to— l)p> 50, there is a set of L matrices Di / G (1/•\/4 (to^^T)p){— 1,1}*^"* i)xp^ 
I G [L] with L > such that 

||Di,z-Di,HII> 1/40 for (42) 

Since the matrices Di_/G for I G [L], have entries with values in (1/•\/4 (to — l)p){—1,1} their columns all 

have norm equal to 1 /V4p- 

Based on the matrices DijG we now construct a modified set of matrices T) 2 ,i G I G [L\. Let Uj 

denote an arbitrary m x m unitary matrix satisfying 


doj — UjOi. 


(43) 


Here, do.j denotes the jth column of DqG Then, we define the matrix 1^2,1 column-wise, by constructing its jth 

column d 2 ,ij as 


d2,i,j = Uj 


0 

dijj 


(44) 


where di is the jth column of the matrix DiNote that, for any I G [L], the jth column d 2 jj of D 2 ,; is orthogonal 
to the column doj and has norm equal to l/i/4p, i.e., 

diag{DQD 2 ,i} = 0, and diag{D^;D 2 ,;} = for any I G [L], (45) 

Moreover, for two distinct indices 1,1' G [L], we have 

IID 2 ,; - B 2 AI ® ||Di., - f 1/40. (46) 


Consider the matrices D;, 
where I G [L] and 


D; = ^/l-e7(4p)Do -f V^Ti2,u 


e' = 320£. 


(47) 

(48) 


The construction (l47l i is feasible, since (HTt guarantees e' <r^ < Ap. We will now verify that the matrices D/, for Z G [L], 
belong to XCDojt) and moreover are such that (l32l i and (1331) . with 77 given in (l40l) . is satisfied. 


D/ belongs to X{'Do,r): Consider the jth column d; j, dpj and d 2 ,;j of D/, Dq and D 2 ,/, respectively. Then 

I|dij|l 2 (1 - e7(4p))||do,,||^ + e'||d 2 ./,,|l 2 (1 - e'/(Ap)) + (e'/iAp)) = 1. (49) 


Thus, the columns of any Dj, for I G [L], have unit norm. Moreover, 

||Dz - Dolll 1* ||(l-v'l-eV(4p))Do - V^B2,i\\l 
5 (1 —v/l —e'/(4p))^||Do|||+e'||D2,i||F 
5 (1— \/l —e'/(4p))^||Do|||-|-eV4 


e7(4p)<1.DoGX> 

< {e'/{Ap)rp- 


■e'/A 
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e7(4p)<l 

< 


(1/2)7 


ED „ 

< 7. 


Lower bounding ||D/ — D//||^; The squared distance between two different matrices and D// is obtained as 


|D;-Dp||i'i’7||D2./-D2.Fl'2 


^2,Flip 


ED ,, 

> 7/40. 


(50) 


Thus, we have verified 

||Di-DH|?>7/40®8£, (51) 

for any two different 1,1' & [L], 

Upper bounding /(Y; Z|T(X)).- We will now upper bound the conditional MI /(Y;Z|T(X)), conditioned on the side 
information T(X)=X, between the observation Y and the index I of the true dictionary D = D; G X>o in ®- Here, the 
random index I is taken u.a.r. from the set [L], First, note that the dictionaries D; given by (l47l i. satisfy 


< 7(||D2^i||F + ||D2 ,f||f)^ 


= 47||D2,||| 

EDJD 320 ,. 


(52) 


According to our observation model ®, conditioned on the coefficients x^, the observations follow a multivariate 
Gaussian distribution with covariance matrix cr^I and mean vector Dx^. Therefore, we can employ a standard argument 
based on the convexity of the Kullback-Leibler (KL) divergence (see, e.g., ETJ) to upper bound /(Y; Z|T(X)) as 

/(Y7|T(X))<^ ^ Ex{f2(/D,(Y|X)||/D,(Y|X))}, (53) 

1,1' G [Z/] 

where D{fYi^ (Y|X)| |/d,, (Y|X)) denotes the KL divergence between the conditional probability density functions (given 
the coefficients X= (xi,..., xjv)) of the observations Y for the true dictionary being either D; or D;/. Since, given the 
coefficients X, the observations are independent multivariate Gaussian random vectors with mean Dx^ and the same 
covariance matrix we can apply the formula m Eq. (3)] for the KL-divergence to obtain 

f4(/D,(Y|X)||/D,(Y|X))= ^ 1 ||(Di-D,0xfef 

feG[Af] 

= E ^Tr{(Di-DF)^(D,-DF)xfexn. (54) 

feG[Af] ^ 

Inserting (l54l i into (l5^ and using (l5^ as well as 

Tr{A^ASJ< ||S,|| 2 ||A|||, 


yields 


/(Y;(|T(X)) < 


320A||S,|l2e 


(55) 


completing the proof. 
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For the proof of Theorem IIII.3I we will need a variation of Lemma IIV.2I which is based on using the side information 
T(X)=supp(X) instead of X itself. 

Lemma IV.3. Consider a DL problem based on the generative model (|4]i such that Q holds with some r < The 

random sparse coefficients x are distributed according to (1221) with SNR < (l/(9-\/M))m/(2s). We assume that the 
reference dictionary Dq satisfies the RIP of order s with constant Ss < 1/2. 

If {m—l)p > 50 then there exists a set T>o Q T) of cardinality L = such that (1321) and (1331) . for the side 

information T(X) = supp(X), are satisfied with 

77 = UmONSNR^m^e/p, (56) 


and 

e < rV(320s). (57) 

Proof: We will use the same ensemble Vq (cf. (l47l) ') as in the proof of Lemma IIV.2I (note that condition (1571) implies 
(HTI) since s > 1). Thus, we already verihed in the proof of Lemma IIV.21 that Vq C T’(Do,r) and (l32l ) is satished. 

Upper bounding /(Y; ^|T(X)).- We will now upper bound the conditional MI /(Y;^|T(X)), conditioned on the side 
information T(X) = supp(X), between the observation Y = (yi,... ,yAr) and the index I of the true dictionary D = 
D; £ Vq in (01). Here, the random index I is taken u.a.r. from the set \L] and the conditioning is w.r.t. the random supports 
supp(X)= (supp(xi),... ,supp(xAr)) of the coefficient vectors x^, being i.i.d. realizations of the sparse vector x given 
by (1221) . Let us introduce for the following the shorthand Sk = supp(xfe). 

Note that, conditioned on Sk, the columns of the matrix Y, i.e., the observed samples y^ are independent multivariate 
Gaussian random vectors with zero mean and covariance matrix 


Sfe — + a 1. 


(58) 


Thus, according to ll^ Eq. (18)], we can use the following bound on the conditional MI 


/(Y;;|T(X)) <Et(: 




^ fcG[Af] 


1,1' G [Ij] 


with 


A _2-n_ _ t^T , 




l,Sk 


(59) 


(60) 


Here, Ex(x){ • } denotes expectation with respect to the side information T(X) = (Si,... ,Sn) which is distributed 
uniformly over the A^-fold product 5 X ... X (cf. (fTO l. Since any of the matrices ^ is made up of the common 
component cr^I and the individual component , which has rank not larger than s, for any two 1,1' £ [L], the 

difference Sfc / —satisfies 

rank{Sfc,;-Sfe,i/} < 2s. (61) 


Therefore, using Tr{A} < rank{A}|| AII 2 and (1611 ). we can rewrite ( 1591 ) as 


/(Y;1|T(X))<2sEt(x){ E ^7 E 

fce[Af] i,i'&[L] 


y-l y-l I 



(62) 


In what follows, we will hrst upper bound the spectral norm \\Hk,i' — 'Sk,i 


result II 52 I for matrix inversion, upper bound the spectral norm — S 
then yield the final upper bound on /(Y; Z|T(X)). 

Due to the construction (|47T ). 


1 

k,l' 


l2' 


^k,l — ^k,l' = (J„{U 




l',Sk 


2 and subsequently, using a perturbation 
Inserting these two bounds into (l62l ) will 


I* aW^-e'/4pVu{T)o,s, "Do.s, ^>1, , 5 .) + (63) 
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with the shorthand X = X + X^. In what follows, we need 

||Do, 5||2 < vW2, \\Tf 2 ,i.sh < Vs/i^p), and ||S-j ||2 < l/a^, (64) 

for any I G [L] and any subset S C \p] with |5| < s. The hrst bound in (l64l) follows from the assumed RIP (with 

constant 5s < 1/2) of the reference dictionary Dq. The second bound in (l64l) is valid because the matrices D 2 ./ have 

columns with norm equal to l/^^lp (cf. (l45T l). For the verification of the last bound in (l64l l we note that, according to 

(l60t . Amin(Xfc,/) > Therefore, 

,_ ,_ ^ , _ 

\\'Ek,i-'^k,i'h < 2 \J\-e' j (4p) \fe' s/ (4p) + 2cr^£'s/ (4p) 

? 4.5CT^Ve's/(4p). (65) 


Since the true dictionary D is assumed to satisfy the RIP with constant <5s < 1/2, the low SNR condition SNR < raj(2s) 


implies via (fTSl) . 




1 


( 66 ) 


Since 


"^k.l 


{'Sk,i-'^k,i’)\\^ < ||s 


k,l II2 I 




k,l' 



4.5(cra/CT)^^eVP 


{66j,(52) 

< 1 / 2 , 


(67) 


we can invoke ll52l Theorem 2.3.4.] yielding 




kA' 


\2^ 


2 S 


k,l 




HD 


k,l' 


12 ^ 


2cr 


-41 


Sfc// —s 


k,l\ 


2 ’ 


Inserting (l65] t and (l68] t into (l62] t yields the bound 

I{Y;l\T{X))<ANsa-\l/L^) 


( 68 ) 


< 4 • 4.5^iVs^(cra/o')^£V(4p) 


< 6480A^s^(cra/o')'^£/p 


5s<1/2,(T8} 

< 12960A^SNR^m^e/p, 


(69) 


completing the proof. ■ 

The next result relates the cardinality L of a subset Vq = {Di,... Q V to the conditional MI /(Y;Z|T(X)) 

between the observation Y = (yi,..., y^r), with y^ i.i.d. according to (|4|, and a random index I selecting the true 
dictionary D in (|4| u.a.r. from Vq. 

Lemma IV.4. Consider the DL problem (|4]) with minimax risk e* (cf. ®), which is assumed to be upper bounded by 
a positive number e, i.e., e* < e. Assume there exisits a finite set "Dq = {Di,...,Di} C T? consisting of L distinct 
dictionaries D; G such that 


||Dz-DH||>8j,.re. 


(70) 
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Then, for any function T(X) of the true coefficients X = (xi,..., xjv). 


/(Y;;|T(X))>(l/2)log2(L)-l. 


(71) 


Proof: Our proof idea closely follows those of ll^ Thru. 1]. Consider a minimax estimator D(Y), whose worst case 
MSE is equal to e*, i.e., 

supEy{||D(Y)-D||2} =£*, (72) 

De'P 

and, in turn since Pq ^ 

sup Ey{||D(Y)-D||2} <£*. (73) 

DeKo 

Based on the estimator D(Y), we define a detector /(Y) for the index of true underlying dictionary D; S Vq via 

i{Y) 4 argmin||Di,-D(Y)||i. (74) 

Z'6[L] 

In case of ties, i.e., when there are multiple indices I' such that D;/ achieves the minimum in (l74l l. we randomly select 
one of the minimizing indices as the estimate /(Y). Let us now assume that the index I is selected u.a.r. from \L] and 
bound the probability Pg of a detection error, i.e., = P{Z(Y) ^ 1}. Note that if 

||D(Y)-Di||2<2£ (75) 

then for any wrong index G [P] \ {Z}, 

||D(Y)-DH|f = ||D(Y)-D£+Di-Dz||F 

> ||D,-DH|f-||D(Y)-DH|f^ 

|20),|2D 

> (v/8-v/2)Vi 
=\/^ 

•> ||D(Y)-Di||F. (76) 

Thus, the condition (TTST i guarantees that the detector ((Y) in (l74l l delivers the correct index 1. Therefore, in turn, a detection 
error can only occur if ||D — D;||p > 2s implying that 

Pe<P{||D(Y)-D;||i>2£} 

< lEY{||D(Y)-D,||i} 

{73} £* 

< — 

- 2e 

'|'l/2, (77) 

where (a) is due to the Markov inequality Il5^ . However, according to Lemma lA.il we also have 

/(Y;(|T(X)) > log2(L) - Pelog2(L) - 1, (78) 

and, in turn, since Pg < 1/2 by (iTTl i. 


/(Y;Z|T(X))>(l/2)log2(P)-l, 
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completing the proof. ■ 

Finally, we simply have to put the pieces together to obtain Theorem IIII.ll and Theorem IIII.3I 

Proof of Theorem 1111. 11 According to Lemma IIV.21 if (m—l)p>50 and for any e<r^/320 (this condition is implied 
by the first bound in (fTTTi l. there exists a set Vq C T’(Do,r) of cardinality = satisfying and with 

?7 = 3207V||Sa;||2e/o'^. Applying Lemma IIV.41 to the set Vq yields, in turn, 

320iV||S,||2£/a2 > /(Y; Z|T(X)) > (1/2) log2(L) - 1 (79) 


implying 


e > 


320Y||S,|| 


-((l/2)log2(L)-l)> 


320Y||S,||2 


((m—l)p/10 — 1). 


(80) 


Proof of Theorem \lll.'i\ According to Lemma HV. 31 if (m—l)p>50 and for any £<r^/(320s) (this condition is implied 
by the first bound in d24l i). there exists a set Vq C A’(Do,r) of cardinality L = satisfying d32l i and (1331) with 

77 = 12960YTO^SNR^e/p. Applying Lemma IIV.41 to the set Vq yields, in turn, 

12960Afm2SNR2e/p > /(Y; /|T(X)) > (1/2) log2(L) - 1 (81) 


implying 


SNR-^p 
^ - 12960Ym2 


((l/2)log2(L)-l)> 


SNR- 2 p 

- T:((Tn 

12960Nm^^^ 


l)p/10-l). 


(82) 


Proof of Theorem 1111.41 First note that any dictionary D S XiT>o — I,?') can be written as 

D = I + A , with IIAIIF < r. (83) 

Any matrix D of the form d83l l satisfies the RIP with constant 5s such that 

(l-r)2 < 1-^, < 1 + 5, < (l+r)2. (84) 

Moreover, since we assume the coefficient vectors in ([TJ to be discrete-valued £ { —1, 0,1}^ and complying with 

(Ull, 

Exjx2j = s/p. (85) 

and 

||xfc||^ = s. ( 86 ) 


For dHhT l. we used the fact that the non-zero entries of x^ all have the same magnitude equal to one. Combining dShl l with 
d84l l. we obtain the following bound on the SNR: 

fiel l84l 

SNR = Ex{||Dx||2}/E„{||n||2} > (1 - 5 ,)s/(mcr^) > (1 - r)^s/(mcr^). ( 87 ) 


In order to derive an upper bound on the MSE of the DL scheme given by Algorithm [T] we first split the MSE of 
D(Y) = (di(Y),...,dp(Y)) into a sum of the MSE for the individual columns of the dictionary, i.e.. 


Ey{||D(Y) - Dili} = y] EY{||d,(Y) - diWl}. 

i&\p] 


( 88 ) 


Thus, we may analyze the column-wise MSE EY{||d;(Y) — d;|||} separately for each column index I £ [p]. Note that, 
by construction 

||dKY)-d,|| 2 < 2 , 

since the columns of D(Y) and D have norm at most one. 


(89) 
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We will analyze the MSE of the DL scheme in Algorithm [T] by conditioning on a specific event C, defined as 

c = n {lnfc.il < 0.4}. (90) 

ke[N] 

ie[p] 

Assuming Ta/s < 1/10, the occurrence of C implies the estimated coefficient matrix X to coincide with the true coefficients 
X, i.e., 

P{X = X|C} = 1. (91) 

Indeed, if < 1/10 and |nfc.i| < 0.4 for every k G [A^] and I G [p], then ykj > 0.5 if Xk,j = 1 (implying Xkj = 1), 
and j/fcj < —0.5 if Xk,j = —1 (implying Xfc.j = —1) as well as \yk,j\ < 0.5 if Xk,j = 0 (implying Xkj = 0). The 
characterization the probability of C is straightforward, since the noise entries nk,i are assumed i.i.d. Gaussian variables 
with zero mean and variance a^. In particular, the tail bound ll47l Proposition 7.5]) together with a union bound over all 
entries of the coefficient matrix X= (xi,..., xjv) G , yields 

P{C=} < exp(-piV0.4V(2cr2)). (92) 

As a next step we upper bound the MSE using the law of total expectation; 

EyIIISKY) - dillil = EY.N{||dz(Y) - d,||i|C}P(C) + EY,N{||di(Y) - di||2|C‘=}P(C=) 

? EY.N{||d,(Y) - di\\l\C}P{C) + 2P(C=) 

< EY.NllIdKY) - di\\l\C} + 2exp(-piV0.4V(2a2)). (93) 


The conditional MSE E{||d/(Y) — djlHICj can be bounded by 

EY.NllIdKY) - d^ll^lC} = EY.N{||PH(e,.p)di(Y) - d,||2|C} 

<EY.N{||dz(Y)-di||'|C} 

= EY.N{||(p/(Ars)) y] Xk,iyk-di\\l\C] 
k^[N] 

— Ey,x,n{||(p/(A^s)) ife.i(Dxfc + rifc) — d/|| 2 |C} 

k^Ci 

= Ex,N{||(p/(iVs)) y] :rfc,i(Dxfc +nfc) - di\\l\C} (94) 

k€Ci 

where step (a) is valid because P{xk,i = Xk,i\C) = 1 (cf. (IMT) 1. Applying the inequality ||y + ^lli<2(||y||i + ||z||i)to 
(l94t yields further 

EY.N{||dz(Y)-dz||i|C}<2Ex,N{||(p/(A^s)) a:fc.infc|| J|C} + 2Ex,N{||di - (p/(lVs)) Xk,i Y. dtXk,t\\l\C}■ 

fee[Af] fcG[Af] tG[p] 

(95) 

Our strategy will be to separately bound the two expectations in (l95l l from above. 

In order to upper bound Ex,n{ ||(p/(-Ys)) X]fce[Ar] a^fe.iHfcH^I^}’ we note that the conditional distribution /(nfc.tjC) of 
rifc.t, given the event C, is given by 

“ VW(Q(-0.4/a) - 


fink,t\C) 


(96) 
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where I[_o. 4 ,o. 4 ](’) the indicator function for the interval [—0.4,0.4] and Q{x) = exp(—(l/2)2^)d0 

denotes the tail probability of the standard normal distribution. In particular, the conditional variance ^ can be bounded 
as 

< aV (Q(-0.4/a) - Q(0.4/a)). (97) 

A 

— V 

Since, conditioned on C, the variables Xk^i and t are independent, we obtain 

Ex.n{||(p/(^s)) ^ Xk,i'a.k\\\\C) = (pl{Ns)f' ^ ^ Ex,N{aili|C}Ex.N{nfe_;|C} 
fce[Ai] fee[Ai]te[m] 

t97l 

< {p/{Ns)fN'Ey^,^{xli\C]ma'^/v 
= {p/{Ns))^NE:s.{xlj}fna^/u 
® (p/{Ns)YN{s/p)'ma‘^/V 

(HI 

> (p/7V)(i_r)V(i/SNR), (98) 

where step (a) is due to the fact that x \.; is independent of the event C. 


As to the second expectation in ( l95l l. we first observe that 

Ex,n{|| di — (p/(./Vs)) ^ ^ djaifc^tllflC} = Ex{||d; — (p/(Afs)) ^ Xk,i ^djaifc,*!!!} (99) 

fcefAl] tG[p] kG[N] te[p] 

since the coefficients Xk,t are independent of the event C. Next, we expand the squared norm und apply the relations 


^X^XkjXk^tXk' ^iXkp} — ^ 


{s/pY , for k' = k, and t = t' ^ I 

{s/pY , for k' 7 ^ k, and t = t' = I 

{s/p) , for k' = k, and t = t' = I 

0 else. 


( 100 ) 


A somewhat lengthy calculation reveals that 


Ex{||dj - (p/(iVs)) ^ Xfc,/^ dta;fe,t|| 2 } = (l/iV)(p + p/s - 2) 
fce[Af] tGb] 

< 2p/N. 


( 101 ) 


Inserting (llOlI l into ( l99l l yields 

Ex,N{||d, - ip/{Ns)) Y, Y ^tXkAl\C} < 2p/N. (102) 

feGflV] tG[p] 


Combining (I1021 i and (l98T l with (l95T l and inserting into (l9?t . we finally obtain 

EY{||d,(Y) -d,||2} < 2[(p/iV)(l -r)V(i.SNR) + 2p/iV] +2exp(-piV0.4V(2a")), (103) 

and in turn, by summing over all column indices I G \p] (cf. (ISST ll. 

Ey{||D(Y) - Dili} < 2[{p^/N){l - r)V(i.SNR) + 2p^/N] + 2pexp(-piV0.4V(2CT")). (104) 

The upper bound (l26l l follows then by noting that v = Q(—0.4/cr) — Q(0.4 /(t) >1/2 for cr < 0.4. 
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V. Conclusion 

By adapting an established information-theoretic approach to minimax estimation, we derived lower bounds on the 
minimax risk of DL using certain random coefficient models for representing the observations as linear combinations of 
the columns of an underlying dictionary matrix. These lower bounds on the optimum achievable performance, quantihed 
in terms of worst case MSB, seem to be the hrst results of their kind for DL. Our hrst bound applies to a wide range 
of coefficient distributions, and only requires the existence of the covariance matrix of the coefficient vector. We then 
specialized this bound to a sparse coefficient model with normally distributed non-zero coefficients. Exploiting the specihc 
structure induced by the sparse coefficient model, we derived a second lower bound which tends to be tighter in the low 
SNR regime. Our bounds apply to the practically relevant case of overcomplete dictionaries and noisy measurements. An 
analysis of a simple DL scheme for the low SNR regime, reveals that our lower bounds are tight, as they are attained 
by the worst case MSE of a particular DL scheme. Moreover, for hxed SNR and vanishing sparsity rate, the necessary 
scaling N = 0(p^) of the sample size N implied by our lower bound matches the sufficient condition (upper bound) on 
the sample size such that the learning schemes proposed in |7|, ESj are successful. Hence, in certain regimes, the DL 
methods put forward by IT], Il26l are essentially optimal in terms of sample size requirements. 
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Appendix A 
Technicalities 


Lemma A.l. Consider the DL problem based on observing the data matrix Y = (yi,..., y^v) with columns being i.i.d. 
realizations of the vector y in @. We stack the corresponding realizations of the coefficient vector x into the matrix 
X. The true dictionary in (|4]) is obtained by selecting u.a.r, and statistically independent of the random coefficients Xfc, 
an element of the set Dq = {Di,...,Dl}, i.e., D = D/ where the index I S [L] is drawn u.a.r. from [L\. Let T(X) 
denote an arbitrary function of the coefficients. Then, the error probability P{^(Y) 1} of any detector 1{Y) which is 

based on observing Y is lower bounded as 


P{f(Y) ^ /} > 1 


/(Y;/|T(X))-H 

log2(L) 


(105) 


where /(Y;Z|T(X)) denotes the conditional MI between Y and I given the side information T(X). 


Proof: According to Lano’s inequality ll44l p. 38], 


pm) ^i}> 


ggiY)-! 

^Og2iL) 


Combining this with the identity ll44l p. 21] 


(106) 


I{l;Y) = H{l)-H{l\Y), 


and the fact that H{1) = log 2 (T), since I is distributed uniformly over [L], yields 


By the chain rule of MI 1441 Ch. 2] 


P{^(Y) ^1}>1 


J(/;Y)+i 

log2(T) 


(107) 


(108) 


/(Y;/) =/(Y,T(X); ()-/((; T(X)|Y) 

= /(Y; ;|T(X)) + I{f T(X))-/((; T(X)|Y) 
^ ^ 

^0 

= /(Y;;|T(X))-/(/;T(X)|Y). 


(109) 
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Here, we used /(/; T(X)) = 0, since the coefficients X and the index I are independent. Since /(/; T(X)|Y) > 0 Il44l Ch. 
2], we have from (1109b that I{Y;l) < /(Y;Z|T(X)). Thus, 


P{l{Y) ^ 1} 


fT08l.fT09l 

> 1 - 


/(Y;/|T(X)) + 1 

l0g2(i) 


( 110 ) 


We also make use of Hoeffding’s inequality ll54l . which characterizes the large deviations of the sum of i.i.d. and 
bounded random variables. 

Lemma A.2 (Theorem 7.20 in II47I '). Let Xr, [/c], be a sequence of i.i.d. zero mean, bounded random variables, i.e., 
\xr\ 0 , for some constant a. Then, 
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